This notebook is an exam hand-in for the course 02467 Computational Social Science at the Technical University of Denmark (DTU) Spring 2022 Semester.
Group members:
All members contributed completely equally.
We had different main responsibilities:
! Notice that analysis of networks and text is presented at the website !
# Import all nessesary libraries
import json
import numpy as np
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup
from collections import Counter
import matplotlib as mpl
import matplotlib.pyplot as plt
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from itertools import combinations
import networkx as nx
import netwulf as nw
import community as community_louvain
from wordcloud import WordCloud, STOPWORDS
from ast import literal_eval
import shifterator as sh
# Define colors for books
colors = ['red', 'green', 'magenta', 'blue', 'purple', 'cyan', 'orange']
All three of us are hard-core Harry Potter fans. We therefore saw this as the perfect opportunity to put all of our Harry Potter fun facts to use. Our main dataset consists of Harry Potter book summaries at chapter-level. Furthermore, we have used a dataset that contain character information to make it easy to identify characters in the summaries. We chose our main dataset based on the fact that we wanted to look at the Harry Potter books, which are much more extensive compared to the movies, but were not able to use the books directly due to copyrights. Fandom (previously known as Wikia) is a service that hosts wikis mainly on entertainment (i.e. books, movies, and tv shows). Thus, it's basically the wikipedia for entertainment and top-quality. This is the reason why we webcapped summaries from there.
Our mission is to do a deep dive into the realm of Harry Potter and examine the development in the seven books. Firstly, use social networks to look at characters. We will have a look at which characters play the most essential roles, which are connected, and what do they have in common if we partition them? Secondly, use text analysis tools to look at topics and themes. We will have a look at what the most important topics are. Furthermore, we have a theory that the books are become more and more dark, sinister, and gloomy, which we can hopefully detect via sentiment analysis. We hope that this will be fun for people who are as passionate about Harry Potter as us.
Process and methods:
# All links starts the same
root = 'https://harrypotter.fandom.com/wiki/Harry_Potter_and_the_'
# End of link for the specific books
endpaths = ['Philosopher%27s_Stone', 'Chamber_of_Secrets', 'Prisoner_of_Azkaban', 'Goblet_of_Fire','Order_of_the_Phoenix', 'Half-Blood_Prince', 'Deathly_Hallows']
# Use list comprehention to create a list of urls
urls = [root+endpath for endpath in endpaths]
def GetData(urls):
"""
Webscrabe and collect all relevant information in a dataframe
"""
#Create empty lists to collect book number, chapter number, chapter title and chapter summary
books = []
chapter_numbers = []
chapter_titles = []
summaries = []
# Scrape for each book
for book, url in enumerate(urls):
# Create soup
req = requests.get(url)
html = req.content
soup = BeautifulSoup(html, 'html.parser')
text = soup.find_all(text=True)
output = ''
for t in text:
output += f'{t} '
# Find headlines
headers = soup.find_all(['h3'])
for head in headers:
if head.span is not None:
head.span.unwrap()
# Convert chapter titles to str
chapter_headers = [head.text for head in headers if head.text[:7]=='Chapter']
# Update chapter numbers
chapter_numbers.extend([i+1 for i in range(len(chapter_headers))])
# Update chapter names
chapter_titles.extend([title[11:] for title in chapter_headers])
# Update book number
books.extend([book+1 for i in range(len(chapter_headers))])
# Retrive the relevent text
# When book summaries end
# they are always followed by either "List of spells first introduced" or "List of deaths"
HeadlineNames = chapter_headers + ['List of spells first introduced', 'List of deaths']
# Unwrap text and headlines
Texts = soup.find_all(['p', 'h3', 'h2'])
for section in Texts:
if section is not None:
if section.a is not None:
section.a.unwrap()
elif section.span is not None:
section.span.unwrap()
# Store text in a list
Texts = [t.text for t in Texts]
# Find section indices in Texts
CaptionIndices = []
for caption in HeadlineNames:
if caption in Texts:
CaptionIndices.append(Texts.index(caption))
CaptionIndices = CaptionIndices[:len(chapter_headers)+1]
# Store texts according to section
SectionText = []
for i in range(len(CaptionIndices)-1):
SectionText.append(Texts[CaptionIndices[i]+1:CaptionIndices[i+1]])
summaries.extend(SectionText)
# Fill dataframe
df = pd.DataFrame({"book":books, "chapter_number": chapter_numbers,"chapter_title": chapter_titles,"summary": summaries})
return df
# Dataframe with summary information
df = GetData(urls)
df
| book | chapter_number | chapter_title | summary | |
|---|---|---|---|---|
| 0 | 1 | 1 | The Boy Who Lived | [Vernon and Petunia Dursley, of Number Four P... |
| 1 | 1 | 2 | The Vanishing Glass | [Dudley counting his presents, Ten years pass ... |
| 2 | 1 | 3 | The Letters from No One | [Hundreds of letters arriving at the fireplace... |
| 3 | 1 | 4 | The Keeper of the Keys | [Rubeus Hagrid enters the cabin, There is anot... |
| 4 | 1 | 5 | Diagon Alley | [Ollivander's Wand Shop, When Harry wakes the ... |
| ... | ... | ... | ... | ... |
| 193 | 7 | 32 | The Elder Wand | [Voldemort and the Elder Wand, Harry, Hermione... |
| 194 | 7 | 33 | The Prince's Tale | [Snape's memories, Harry dives into Snape's me... |
| 195 | 7 | 34 | The Forest Again | [Harry's mother comforting him before his "dea... |
| 196 | 7 | 35 | King's Cross | [Harry during his "death", However, Harry find... |
| 197 | 7 | 36 | The Flaw in the Plan | [Hagrid carrying Harry's body, Back in the for... |
198 rows × 4 columns
Process and methods:
# Load file with characters and their attributes as dataframe
characters = pd.read_json('characters.json')
# Remove pictures of the characters from the dataframe
characters.pop('image')
# Dataframe with character information
characters
| name | alternate_names | species | gender | house | dateOfBirth | yearOfBirth | wizard | ancestry | eyeColour | hairColour | wand | patronus | hogwartsStudent | hogwartsStaff | actor | alternate_actors | alive | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Harry Potter | [] | human | male | Gryffindor | 31-07-1980 | 1980 | True | half-blood | green | black | {'wood': 'holly', 'core': 'phoenix feather', '... | stag | True | False | Daniel Radcliffe | [] | True |
| 1 | Hermione Granger | [] | human | female | Gryffindor | 19-09-1979 | 1979 | True | muggleborn | brown | brown | {'wood': 'vine', 'core': 'dragon heartstring',... | otter | True | False | Emma Watson | [] | True |
| 2 | Ron Weasley | [Dragomir Despard] | human | male | Gryffindor | 01-03-1980 | 1980 | True | pure-blood | blue | red | {'wood': 'willow', 'core': 'unicorn tail-hair'... | Jack Russell terrier | True | False | Rupert Grint | [] | True |
| 3 | Draco Malfoy | [] | human | male | Slytherin | 05-06-1980 | 1980 | True | pure-blood | grey | blonde | {'wood': 'hawthorn', 'core': 'unicorn tail-hai... | True | False | Tom Felton | [] | True | |
| 4 | Minerva McGonagall | [] | human | female | Gryffindor | 04-10-1925 | 1925 | True | black | {'wood': '', 'core': '', 'length': ''} | tabby cat | False | True | Dame Maggie Smith | [] | True | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 398 | Albus Severus Potter | [Al] | human | male | Slytherin | True | half-blood | green | black | {'wood': '', 'core': '', 'length': ''} | True | False | Arthur Bowen | [] | True | |||
| 399 | Rose Weasley | [] | human | female | Gryffindor | True | half-blood | red | {'wood': '', 'core': '', 'length': ''} | True | False | Helena Barlow | [] | True | ||||
| 400 | Hugo Weasley | [] | human | male | True | half-blood | brown | {'wood': '', 'core': '', 'length': ''} | True | False | Ryan Turner | [] | True | |||||
| 401 | Scorpius Malfoy | [Scorpius Hyperion Malfoy] | human | male | Slytherin | True | pure-blood | grey | blond | {'wood': '', 'core': '', 'length': ''} | True | False | Bertie Gilbert | [] | True | |||
| 402 | Victoire Weasley | [] | human | female | True | blonde | {'wood': '', 'core': '', 'length': ''} | True | False | [] | True |
403 rows × 18 columns
def FirstLastNames(names):
"""
Function to find first and last name (if possible) for all characters
"""
# Make empty lists to contain the names
first_names = []
last_names = []
# Go through all names
for name in names:
# If there is a space in the name, we have both a first and last name
if ' ' in name:
split = name.split(' ')
# If the character has more than one last name
if len(split) >= 3:
# Add first name
first_names.append(split[0])
last_name = ''
# Add all last names to the last name list
for part_name in split[1:]:
last_name = (last_name+' '+part_name)
last_names.append(last_name)
# If the character only has 1 first and last name
else:
first_names.append(split[0])
last_names.append(split[1])
# If no last name is known add 'No Known Last Name' =NKLN
else:
first_names.append(name)
last_names.append('NKLN')
return first_names, last_names
def NickNames(alternate_name):
"""
Makes it possible to fill column with NKAN if character has No Known Alternate Name
"""
names = []
for name in alternate_name:
# If no alternate name is known insert NKAN (No Known Alternate Name)
if name == []:
names.append('NKAN')
# Add the alternate name as a str instead of list
else:
names.append(name[0])
return names
def HogwartsHouse(houses):
"""
Makes it possible fill column with NKH if character has No Known House
"""
# If house is not assigned replace empty str with NKH (No Known House)
fill_houses = ["NKH" if x == '' else x for x in houses]
return fill_houses
def SpellNameCorrectly(characters, incorrect_name, correct_name):
"""
Correct name of character if spelled incorrectly
"""
characters.replace(incorrect_name, correct_name)
def CleanCharactersDataframe(characters):
"""
Apply functions FirstLastNames, NickNames, HogwartsHouse and SpellNameCorrectly to dataframe
"""
# We noticed that Quirinus Quirrell was spelled differently in the two files
SpellNameCorrectly(characters,'Quirinus Quirrel','Quirinus Quirrell' )
# Add columns with first- and last name to dataframe
characters['first_names'], characters['last_names'] = FirstLastNames(list(characters['name']))
characters.insert(1, 'first_names', characters.pop('first_names'))
characters.insert(2, 'last_names', characters.pop('last_names'))
# Add column with their alternate names
characters['alternate_names'] = NickNames(list(characters['alternate_names']))
# Add column with their house
characters['house'] = HogwartsHouse(list(characters['house']))
return characters
characters = CleanCharactersDataframe(characters)
# Dataframe with cleaned character information
characters
| name | first_names | last_names | alternate_names | species | gender | house | dateOfBirth | yearOfBirth | wizard | ancestry | eyeColour | hairColour | wand | patronus | hogwartsStudent | hogwartsStaff | actor | alternate_actors | alive | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Harry Potter | Harry | Potter | NKAN | human | male | Gryffindor | 31-07-1980 | 1980 | True | half-blood | green | black | {'wood': 'holly', 'core': 'phoenix feather', '... | stag | True | False | Daniel Radcliffe | [] | True |
| 1 | Hermione Granger | Hermione | Granger | NKAN | human | female | Gryffindor | 19-09-1979 | 1979 | True | muggleborn | brown | brown | {'wood': 'vine', 'core': 'dragon heartstring',... | otter | True | False | Emma Watson | [] | True |
| 2 | Ron Weasley | Ron | Weasley | Dragomir Despard | human | male | Gryffindor | 01-03-1980 | 1980 | True | pure-blood | blue | red | {'wood': 'willow', 'core': 'unicorn tail-hair'... | Jack Russell terrier | True | False | Rupert Grint | [] | True |
| 3 | Draco Malfoy | Draco | Malfoy | NKAN | human | male | Slytherin | 05-06-1980 | 1980 | True | pure-blood | grey | blonde | {'wood': 'hawthorn', 'core': 'unicorn tail-hai... | True | False | Tom Felton | [] | True | |
| 4 | Minerva McGonagall | Minerva | McGonagall | NKAN | human | female | Gryffindor | 04-10-1925 | 1925 | True | black | {'wood': '', 'core': '', 'length': ''} | tabby cat | False | True | Dame Maggie Smith | [] | True | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 398 | Albus Severus Potter | Albus | Severus Potter | Al | human | male | Slytherin | True | half-blood | green | black | {'wood': '', 'core': '', 'length': ''} | True | False | Arthur Bowen | [] | True | |||
| 399 | Rose Weasley | Rose | Weasley | NKAN | human | female | Gryffindor | True | half-blood | red | {'wood': '', 'core': '', 'length': ''} | True | False | Helena Barlow | [] | True | ||||
| 400 | Hugo Weasley | Hugo | Weasley | NKAN | human | male | NKH | True | half-blood | brown | {'wood': '', 'core': '', 'length': ''} | True | False | Ryan Turner | [] | True | ||||
| 401 | Scorpius Malfoy | Scorpius | Malfoy | Scorpius Hyperion Malfoy | human | male | Slytherin | True | pure-blood | grey | blond | {'wood': '', 'core': '', 'length': ''} | True | False | Bertie Gilbert | [] | True | |||
| 402 | Victoire Weasley | Victoire | Weasley | NKAN | human | female | NKH | True | blonde | {'wood': '', 'core': '', 'length': ''} | True | False | [] | True |
403 rows × 20 columns
Process and methods:
def CleanSummaries(df):
"""
Cleans the summaries
"""
# Create for cleaned summaries
summary = []
for i in range(len(df)):
text = df['summary'].iloc[i]
# Remove newline
text = [t.strip() for t in text]
# Split in non-sign tokens
tokens = []
for t in text:
tokens = tokens + re.split(r'\W', t)
# Join tokens
text = ' '.join(tokens)
text = text.replace(' s ', ' ')
text = text.replace(' t ', ' ')
text = text.replace(' ', ' ')
# Append cleaned text to summary
summary.append(text)
return summary
%%capture
#Only run once - otherwise it will be a little too tokenized if u catch my drift ;)
df['summary'] = CleanSummaries(df)
df['plot_tokens'] = [row['summary'].split(" ") for _,row in df.iterrows()]
# save df to csv file
df.to_csv('plot_summary_df.csv')
# Dataframe with cleaned and tokenized plot summaries
df
| book | chapter_number | chapter_title | summary | plot_tokens | |
|---|---|---|---|---|---|
| 0 | 1 | 1 | The Boy Who Lived | Vernon and Petunia Dursley of Number Four Priv... | [Vernon, and, Petunia, Dursley, of, Number, Fo... |
| 1 | 1 | 2 | The Vanishing Glass | Dudley counting his presents Ten years pass si... | [Dudley, counting, his, presents, Ten, years, ... |
| 2 | 1 | 3 | The Letters from No One | Hundreds of letters arriving at the fireplace ... | [Hundreds, of, letters, arriving, at, the, fir... |
| 3 | 1 | 4 | The Keeper of the Keys | Rubeus Hagrid enters the cabin There is anothe... | [Rubeus, Hagrid, enters, the, cabin, There, is... |
| 4 | 1 | 5 | Diagon Alley | Ollivander Wand Shop When Harry wakes the next... | [Ollivander, Wand, Shop, When, Harry, wakes, t... |
| ... | ... | ... | ... | ... | ... |
| 193 | 7 | 32 | The Elder Wand | Voldemort and the Elder Wand Harry Hermione an... | [Voldemort, and, the, Elder, Wand, Harry, Herm... |
| 194 | 7 | 33 | The Prince's Tale | Snape memories Harry dives into Snape memories... | [Snape, memories, Harry, dives, into, Snape, m... |
| 195 | 7 | 34 | The Forest Again | Harry mother comforting him before his death H... | [Harry, mother, comforting, him, before, his, ... |
| 196 | 7 | 35 | King's Cross | Harry during his death However Harry finds him... | [Harry, during, his, death, However, Harry, fi... |
| 197 | 7 | 36 | The Flaw in the Plan | Hagrid carrying Harry body Back in the forest ... | [Hagrid, carrying, Harry, body, Back, in, the,... |
198 rows × 5 columns
Notice that most analysis can be found on the website and not in this Notebook!
Process and methods:
def FindConnectedCharacters(df, name, first_names, last_names, alternate_names):
"""
Detect connected characters
"""
# Prefixes that identify characters with common prefixes
exeptions = ['Mrs','Mr','Fat','Nearly','Sir','The','Bloody','Moaning','Dr','Madam','Wizard']
# Some locations contain names that must not be confused with the character
places = ['Godric Hollow','Slytherin Chamber','Myrtle Bathroom','Malfoy Manor','Weasly Burrow','Hagrid Hut']
# The houses also share names with the last name of character
houses = ['Gryffindor','Hufflepuff','Ravenclaw','Slytherin']
# Get connections
Connections = {}
for _,row in df.iterrows(): #all texts
TextEdges = []
mentioned = []
for i in range(len(first_names)):
all_names = [name[i],first_names[i],last_names[i],alternate_names[i]]
if first_names[i] in exeptions:
if name[i] in row['summary']:
TextEdges.append(name[i])
else:
continue
elif last_names[i] in houses:
if first_names[i] in row['summary']:
TextEdges.append(name[i])
else:
continue
else:
for split_name in all_names:
if split_name in row['summary'].split(' '):
if split_name in mentioned:
break
# Tom Riddle is not an individual character and should be associated with Voldemort.
if name[i] == 'Tom Riddle':
TextEdges.append('Lord Voldemort')
mentioned.append('Lord Voldemort')
mentioned.append('Lord')
mentioned.append('Voldemort')
else:
TextEdges.append(name[i])
mentioned.append(name[i])
mentioned.append(first_names[i])
mentioned.append(last_names[i])
break
else:
continue
# Store connection
Connections[(row['book'], row['chapter_number'], row['chapter_title'])] = set(TextEdges)
return Connections
Connections = FindConnectedCharacters(df,characters['name'],characters['first_names'],characters['last_names'],characters['alternate_names'])
# Find unique characters mentioned
unique_characters = list(Connections.values())
unique_characters = [item for sublist in unique_characters for item in sublist]
# Make dataframe where only the mentioned unique characters are in (to minimize mistakes)
characters_p = characters[characters['name'].isin(unique_characters)]
characters_p = characters_p.reset_index(drop=True)
def ConnectionsCounted(df, name, first_name, last_name, alternate_name):
"""
Count number of times unique characters are mentioned
"""
# These prefixes are irrelevant
exeptions = ['Mrs','Mr','Fat','Nearly','Sir','The','Bloody','Moaning','Dr','Madam','Wizard']
# Some locations contain names are must not me confused with the character
places = ['Godric Hollow','Slytherin Chamber','Myrtle Bathroom','Malfoy Manor','Weasly Burrow','Hagrid Hut']
# The houses also share names with the last name of character
houses = ['Gryffindor','Hufflepuff','Ravenclaw','Slytherin']
# Get connections
Connections = {}
for _,row in df.iterrows():
TextEdges = []
mentioned = []
tokens = row['summary'].split(' ')
for j in range(len(first_name)):
if first_name[j] in exeptions:
if name[j] in row['summary']:
TextEdges.append([name[j]]*row['summary'].count(name[j]))
continue
else:
continue
elif last_name[j] in houses:
if first_name[j] in row['summary']:
TextEdges.append([name[j]]*row['summary'].count(name[j]))
continue
else:
continue
for i in range(len(tokens)):
if tokens[i] == first_name[j]:
TextEdges.append([name[j]])
mentioned.append(first_name[j])
mentioned.append(last_name[j])
mentioned.append(name[j])
if tokens[i] == last_name[j]:
if first_name[j] in tokens[i-4:i]:
continue
else:
if name[j] in mentioned:
TextEdges.append([name[j]])
elif first_name[j] not in mentioned:
if last_name[j] in mentioned:
continue
else:
TextEdges.append([name[j]])
mentioned.append(first_name[j])
mentioned.append(last_name[j])
mentioned.append(name[j])
continue
else:
continue
# Store connection
Connections[(row['book'],row['chapter_number'], row['chapter_title'])] = TextEdges
return Connections
Connections_count = ConnectionsCounted(df,characters_p['name'],characters_p['first_names'],characters_p['last_names'],characters_p['alternate_names'])
def UnNest(dictionary):
"""
Unnest dictionary
"""
keys = list(dictionary.keys())
for key in keys:
value = dictionary[key]
dictionary[key] = [item for sublist in value for item in sublist]
return dictionary
Connections_count = UnNest(Connections_count)
def GetEdges(Connections):
"""
Compute which characters are connected two-and-two
"""
# Get all edges
Edges = []
for g in Connections.values():
Edges += [i for i in combinations(g, 2)] #For each fully connected subgraph, add all links in that graph
# Remove double. tripple etc. edges
Edges = [tuple(sorted(edge)) for edge in Edges]
Edges = set(Edges)
return Edges
Edges = GetEdges(Connections)
def GetLinkWeights(Edges, Connections):
"""
Compute number of times two characters co-appear in a summary
"""
Weights = {edge: 0 for edge in Edges}
for edge in Weights.keys():
w = 0
for s in Connections.values():
if (edge[0] in s) and (edge[1] in s):
w +=1
Weights[edge] = w
WeightsInput = [(key[0],key[1], {'weight': val}) for key, val in Weights.items()]
return WeightsInput
def GetNodeWeights(Connections):
"""
Compute number of times the characters are mentioned
"""
unique_characters = list(Connections.values())
unique_characters = [item for sublist in unique_characters for item in sublist]
count = Counter(unique_characters)
count = dict(sorted(count.items(), key=lambda item: item[1],reverse = True))
return count
NodeWeights = GetNodeWeights(Connections)
Split the data into individual books as well and not only the entire book series.
# Unweighted
for i in range(1,8):
keys = [k for k in Connections if k[0] ==i]
globals()[f"UWC{i}"] = {your_key: Connections[your_key] for your_key in keys}
# Weighted
for i in range(1,8):
keys = [k for k in Connections_count if k[0] ==i]
globals()[f"WC{i}"] = {your_key: Connections_count[your_key] for your_key in keys}
# Edges for the books individually and the book series
Edges = GetEdges(Connections)
print("# Edges in all books is", len(Edges), ". Thus, there are", len(Edges), "character relations mentioned in the summaries.")
# Get number of edges for each book
E1 = len(GetEdges(UWC1))
E2 = len(GetEdges(UWC2))
E3 = len(GetEdges(UWC3))
E4 = len(GetEdges(UWC4))
E5 = len(GetEdges(UWC5))
E6 = len(GetEdges(UWC6))
E7 = len(GetEdges(UWC7))
# Create a dataset
h = [E1, E2, E3, E4, E5, E6, E7]
b = ('Book1', 'Book2', 'Book3', 'Book4', 'Book5', 'Book6', 'Book7')
x_pos = np.arange(len(b))
# Create bars with different colors
plt.bar(x_pos, h, color=colors)
# Create names on the x-axis
plt.xticks(x_pos, b)
# Create title name
plt.title("Edges for each book individually")
# Show graph
plt.show()
# Edges in all books is 3283 . Thus, there are 3283 character relations mentioned in the summaries.
# Nodes for the books individually and the book series
Nodes = GetNodeWeights(Connections)
print("# Nodes in all books is", len(Nodes), ". Thus, there are", len(Nodes), "unique characters in the summaries.")
# Get number of nodes for each book
N1 = len(GetNodeWeights(UWC1))
N2 = len(GetNodeWeights(UWC2))
N3 = len(GetNodeWeights(UWC3))
N4 = len(GetNodeWeights(UWC4))
N5 = len(GetNodeWeights(UWC5))
N6 = len(GetNodeWeights(UWC6))
N7 = len(GetNodeWeights(UWC7))
# Create a dataset
h = [N1, N2, N3, N4, N5, N6, N7]
b = ('Book1', 'Book2', 'Book3', 'Book4', 'Book5', 'Book6', 'Book7')
x_pos = np.arange(len(b))
# Create bars with different colors
plt.bar(x_pos, h, color=colors)
# Create names on the x-axis
plt.xticks(x_pos, b)
# Create title name
plt.title("Nodes for each book individually")
# Show graph
plt.show()
# Nodes in all books is 183 . Thus, there are 183 unique characters in the summaries.
The aim of the network analysis was to locate communities in the dataset. Furthermore, the analysis should make it possible to identify the main characters of every book. The Louvain method for community detection is used to partion characters into communities. The Louvain method computes the partition of all nodes in respect to the partition which maximises the modularity by the use of Louvain heuristics. Unweighted graphs for the book series as a whole and the individual books in combination with Louvain communities will be visualized and analysed. Based on these visualizations, it will be discussed whether the communities make sense in relation to the plots of every book. For us, it will be fun to see if the Louvain community detection method supports a pattern that we, as fans, can see makes sense. Furthermore, modularity will be used to measure the the density of connections within a community. High modularity indicates dense connections between the nodes within communities but sparse connections between nodes in different modules. Modularity is in the interval [-1.0,1.0].
Weighted graphs for the book series as a whole and the individual books will also be visualized and analysed. They make it possible to determine the main characters in the book(s). This requires that the node weight of each character is computed. Notice that this node weight is calculated by counting how many times each character is mentioned in total. From this method, one should be able to detect the most important characters of the book(s) based on the specific mention count. To the suprise over everyone, we firmly believe that Harry Potter comes out as the winner at that point ;)
def UnweightedGraph(Connections):
"""
Uses NetworkX to build an unweighted graph.
"""
# Compute links:
Edges = GetEdges(Connections)
# Initialize
G = nx.Graph()
# Add nodes
G.add_nodes_from(set().union(*Connections.values()), LouvainPartition = None, group = None)
# Add edges to graph
G.add_edges_from(Edges)
return G
def GraphModularity(graph, p):
"""
Computes modularity
"""
# Number of links in graph:
L = len(graph.edges())
M = 0
key = p[0]
p = p[1]
deg = {i:0 for i in p}
links = {i:0 for i in p}
# Loop through graph nodes
for node in graph:
par = graph.nodes[node][key] # Node partition
deg[par] += graph.degree[node] # Node degree
l = sum([w.get('weight', 1)/2 for n,w in graph[node].items() if graph.nodes[node][key] == graph.nodes[n][key]])
links[par] += l
for par in p:
M += links[par]/L - (deg[par]/(2*L))**2
return M
def WeightedGraph(data, Connections, Nodes):
"""
Uses NetworkX to build a Weighted Graph.
"""
# Compute node weights
node_weight = GetNodeWeights(Connections)
# Find links:
Edges = GetEdges(Connections)
# Compute link weight
link_weight = GetLinkWeights(Edges, Connections)
# Initialise weighted graph
G_W = nx.Graph()
# Add nodes
G_W.add_nodes_from(set().union(*Connections.values()), LouvainPartition = None, group = None, size = None)
# Add node size (weight)
for key, val in node_weight.items():
G_W.nodes[key]['size'] = val
# Add links and weights
G_W.add_edges_from(link_weight)
return G_W
# Undirected graph
G_UW = UnweightedGraph(Connections)
# Compute the best partition
LouvainCommunities = community_louvain.best_partition(G_UW)
# Add this to graph
nx.set_node_attributes(G_UW, LouvainCommunities, 'group')
# Interactive graph
#UWGraphVisu, _ = nw.visualize(G_UW)
# Give communities an id
community_id = [LouvainCommunities[node] for node in G_UW.nodes()]
# Visualise graph
fig = plt.figure(figsize=(25,20))
nx.draw(G_UW,
edge_color = 'lightgrey',
cmap = plt.cm.PuRd,
node_color = community_id,
node_size = 500,
with_labels = True,
edgecolors = 'black')
# Compute modularity of Louvain partition
p = np.unique([i for i in LouvainCommunities.values()])
partition = ['group', p]
M = GraphModularity(G_UW, partition)
print("Modularity of unweighted graph is ", str(M))
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-33-698d91d80468> in <module> 3 4 # Compute the best partition ----> 5 LouvainCommunities = community_louvain.best_partition(G_UW) 6 7 # Add this to graph AttributeError: module 'community' has no attribute 'best_partition'
G_W = WeightedGraph(characters_p, Connections_count, Nodes)
WGraphVisu, _ = nw.visualize(G_W)
# Undirected graph
G1_UW = UnweightedGraph(UWC1)
# Compute the best partition
LouvainCommunities = community_louvain.best_partition(G1_UW)
# Add this to graph
nx.set_node_attributes(G1_UW, LouvainCommunities, 'group')
# Interactive graph
#UWGraphVisu1, _ = nw.visualize(G1_UW)
# Give communities an id
community_id = [LouvainCommunities[node] for node in G1_UW.nodes()]
# Visualise graph
fig = plt.figure(figsize=(25,20))
nx.draw(G1_UW,
edge_color = 'lightgrey',
cmap = plt.cm.PuRd,
node_color = community_id,
node_size = 500,
with_labels = True,
edgecolors = 'black')
# Compute modularity of Louvain partition
p = np.unique([i for i in LouvainCommunities.values()])
partition = ['group', p]
M1 = GraphModularity(G1_UW, partition)
print("Modularity of unweighted graph is ", str(M1))
G1_W = WeightedGraph(characters_p, WC1, Nodes)
#WGraphVisu1, _ = nw.visualize(G1_W)
# Undirected graph
G2_UW = UnweightedGraph(UWC2)
# Compute the best partition
LouvainCommunities = community_louvain.best_partition(G2_UW)
# Add this to graph
nx.set_node_attributes(G2_UW, LouvainCommunities, 'group')
# Interactive graph
#UWGraphVisu2, _ = nw.visualize(G2_UW)
# Give communities an id
community_id = [LouvainCommunities[node] for node in G2_UW.nodes()]
# Visualise graph
fig = plt.figure(figsize=(25,20))
nx.draw(G2_UW,
edge_color = 'lightgrey',
cmap = plt.cm.PuRd,
node_color = community_id,
node_size = 500,
with_labels = True,
edgecolors = 'black')
# Compute modularity of Louvain partition
p = np.unique([i for i in LouvainCommunities.values()])
partition = ['group', p]
M2 = GraphModularity(G2_UW, partition)
print("Modularity of unweighted graph is ", str(M2))
G2_W = WeightedGraph(characters_p, WC2, Nodes)
#WGraphVisu2, _ = nw.visualize(G2_W)
# Undirected graph
G3_UW = UnweightedGraph(UWC3)
# Compute the best partition
LouvainCommunities = community_louvain.best_partition(G3_UW)
# Add this to graph
nx.set_node_attributes(G3_UW, LouvainCommunities, 'group')
# Interactive graph
#UWGraphVisu3, _ = nw.visualize(G3_UW)
# Give communities an id
community_id = [LouvainCommunities[node] for node in G3_UW.nodes()]
# Visualise graph
fig = plt.figure(figsize=(25,20))
nx.draw(G3_UW,
edge_color = 'lightgrey',
cmap = plt.cm.PuRd,
node_color = community_id,
node_size = 500,
with_labels = True,
edgecolors = 'black')
# Compute modularity of Louvain partition
p = np.unique([i for i in LouvainCommunities.values()])
partition = ['group', p]
M3 = GraphModularity(G3_UW, partition)
print("Modularity of unweighted graph is ", str(M3))
G3_W = WeightedGraph(characters_p, WC3, Nodes)
#WGraphVisu3, _ = nw.visualize(G3_W)
# Undirected graph
G4_UW = UnweightedGraph(UWC4)
# Compute the best partition
LouvainCommunities = community_louvain.best_partition(G4_UW)
# Add this to graph
nx.set_node_attributes(G4_UW, LouvainCommunities, 'group')
# Interactive graph
#UWGraphVisu4, _ = nw.visualize(G4_UW)
# Give communities an id
community_id = [LouvainCommunities[node] for node in G4_UW.nodes()]
# Visualise graph
fig = plt.figure(figsize=(25,20))
nx.draw(G4_UW,
edge_color = 'lightgrey',
cmap = plt.cm.PuRd,
node_color = community_id,
node_size = 500,
with_labels = True,
edgecolors = 'black')
# Compute modularity of Louvain partition
p = np.unique([i for i in LouvainCommunities.values()])
partition = ['group', p]
M4 = GraphModularity(G4_UW, partition)
print("Modularity of unweighted graph is ", str(M4))
G4_W = WeightedGraph(characters_p, WC4, Nodes)
#WGraphVisu4, _ = nw.visualize(G4_W)
# Undirected graph
G5_UW = UnweightedGraph(UWC5)
# Compute the best partition
LouvainCommunities = community_louvain.best_partition(G5_UW)
# Add this to graph
nx.set_node_attributes(G5_UW, LouvainCommunities, 'group')
# Interactive graph
#UWGraphVisu5, _ = nw.visualize(G5_UW)
# Give communities an id
community_id = [LouvainCommunities[node] for node in G5_UW.nodes()]
# Visualise graph
fig = plt.figure(figsize=(25,20))
nx.draw(G5_UW,
edge_color = 'lightgrey',
cmap = plt.cm.PuRd,
node_color = community_id,
node_size = 500,
with_labels = True,
edgecolors = 'black')
# Compute modularity of Louvain partition
p = np.unique([i for i in LouvainCommunities.values()])
partition = ['group', p]
M5 = GraphModularity(G5_UW, partition)
print("Modularity of unweighted graph is ", str(M5))
G5_W = WeightedGraph(characters_p, WC5, Nodes)
#WGraphVisu5, _ = nw.visualize(G5_W)
# Undirected graph
G6_UW = UnweightedGraph(UWC6)
# Compute the best partition
LouvainCommunities = community_louvain.best_partition(G6_UW)
# Add this to graph
nx.set_node_attributes(G6_UW, LouvainCommunities, 'group')
# Interactive graph
#UWGraphVisu6, _ = nw.visualize(G6_UW)
# Give communities an id
community_id = [LouvainCommunities[node] for node in G6_UW.nodes()]
# Visualise graph
fig = plt.figure(figsize=(25,20))
nx.draw(G6_UW,
edge_color = 'lightgrey',
cmap = plt.cm.PuRd,
node_color = community_id,
node_size = 500,
with_labels = True,
edgecolors = 'black')
# Compute modularity of Louvain partition
p = np.unique([i for i in LouvainCommunities.values()])
partition = ['group', p]
M6 = GraphModularity(G6_UW, partition)
print("Modularity of unweighted graph is ", str(M6))
G6_W = WeightedGraph(characters_p, WC6, Nodes)
#WGraphVisu6, _ = nw.visualize(G6_W)
# Undirected graph
G7_UW = UnweightedGraph(UWC7)
# Compute the best partition
LouvainCommunities = community_louvain.best_partition(G7_UW)
# Add this to graph
nx.set_node_attributes(G7_UW, LouvainCommunities, 'group')
# Interactive graph
#UWGraphVisu7, _ = nw.visualize(G7_UW)
# Give communities an id
community_id = [LouvainCommunities[node] for node in G7_UW.nodes()]
# Visualise graph
fig = plt.figure(figsize=(25,20))
nx.draw(G7_UW,
edge_color = 'lightgrey',
cmap = plt.cm.PuRd,
node_color = community_id,
node_size = 500,
with_labels = True,
edgecolors = 'black')
# Compute modularity of Louvain partition
p = np.unique([i for i in LouvainCommunities.values()])
partition = ['group', p]
M7 = GraphModularity(G7_UW, partition)
print("Modularity of unweighted graph is ", str(M7))
G7_W = WeightedGraph(characters_p, WC7, Nodes)
#WGraphVisu7, _ = nw.visualize(G7_W)
# Modularity plots
# Create a dataset
h = [M, M1, M2, M3, M4, M5, M6, M7]
b = ('All', 'Book1', 'Book2', 'Book3', 'Book4', 'Book5', 'Book6', 'Book7')
x_pos = np.arange(len(b))
# Create bars with different colors
# Add color for entire book to our color scheme
special_colors = ['grey', 'red', 'green', 'magenta', 'blue', 'purple', 'cyan', 'orange']
plt.bar(x_pos, h, color=special_colors)
# Create names on the x-axis
plt.xticks(x_pos, b)
# Create title name
plt.title("Modularity for the entire book series and each book individually")
# Show graph
plt.show()
People definitely interact with each other across communities which this plot supports as the modularity appears relatively neutral. In Books 5-7 the battle between Good (Harry) and Evil (Voldemort) becomes more clear so does the tendency to interact more closely with one's community.
WordClouds are used to help us identify the most unique and important topics in the book series and each individual book.
Process and method:
# Make tokens lower case
lower_tokens = []
for index, row in df.iterrows():
lower_tokens.append([token.lower() for token in row['plot_tokens']])
df['tokens_lower'] = lower_tokens
df
# Create WordCloud for entire book
full_text = " "
for index, row in df.iterrows():
full_text += row['summary'].lower() + " "
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
min_font_size = 10).generate(full_text)
# Plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
# Make dictionary with tokens as list
docs = {i: {'tokens':[]} for i in range(1,8)}
for index, row in df.iterrows():
docs[row['book']]['tokens'].extend(row['tokens_lower'])
# Make each list into a nltk Text
for book in docs.keys():
docs[book]['text'] = nltk.Text(docs[book]['tokens'])
# For each book
for book in docs.keys():
# Create document from text
doc = docs[book]['text']
# Frequency distribution
fdist = nltk.FreqDist(doc)
# Find unique word in text
docs[book]['unique_words'] = [tup for tup in fdist]
# Find term frequency
docs[book]['TF'] = [fdist[word]/len(doc) for word in docs[book]['unique_words']]
# Connect term frequency with the word in dictionary
docs[book]['TF_word'] = Counter({word: docs[book]['TF'][i] for i, word in enumerate(docs[book]['unique_words'])})
# How many of the documents are each word in?
cumulative_unique = []
for book in docs.keys():
cumulative_unique.extend(docs[book]['unique_words'])
# Make into nltk Text
cumulative_unique = nltk.Text(cumulative_unique)
# Frequency distribution
fdist_allwords = nltk.FreqDist(cumulative_unique)
N = 7 #Number of books
# For each book
for book in docs.keys():
# Inverse document frequency (IDF)
docs[book]['IDF'] = [np.log10(N/fdist_allwords[word]) for word in docs[book]['unique_words']]
# Term frequency-inverse document frequency (TF-IDF)
docs[book]['TF-IDF'] = [docs[book]['TF'][i]*docs[book]['IDF'][i] for i in range(len(docs[book]['unique_words']))]
# Connect term frequency-inverse document frequency with the word in a dictionary
docs[book]['TF-IDF_word'] = Counter({word: docs[book]['TF-IDF'][i] for i, word in enumerate(docs[book]['unique_words'])})
# Create a WordCloud per book based on term frequency-inverse document frequency (TF-IDF)
for book in docs.keys():
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
min_font_size = 10).generate_from_frequencies(docs[book]['TF-IDF_word'])
# Plot the WordCloud
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
Sentiment analysis will help us test our theory on whether the books become more and more dark, sinister, and gloomy - both in relation to book depth and the book series as a whole. We will use the labMT happiness scores from https://hedonometer.org/words/labMT-en-v2/ (downloaded as a .csv file) to compute a happiness score for each chapter.
Process and method:
# Get hedonometer happiness scores
df_hedonometer = pd.read_csv('Hedonometer.csv', index_col=0)
# Alter score to be from -5 to 5 instead of 0 to 10
altered_happiness = [score-5 for score in df_hedonometer['Happiness Score']]
df_hedonometer['Happiness Score'] = altered_happiness
df_hedonometer
The happiness score is adjusted to be from -5 (most unhappy) to 5 (most happy) with 0 being the neutral middle.
# Make a happiness dictionary from dataframe to use for lookup
happiness_score_dict = {row.Word: row['Happiness Score'] for index, row in df_hedonometer.iterrows()}
def GetHappinessScore(tokens):
"""
Takes a list of token
Returns the mean and standard deviation of the happiness score of the text
"""
# Get list of scores for all the words from the list that exist in the labmt dataset
happiness_score = [happiness_score_dict[token] for token in tokens if token in happiness_score_dict.keys()]
# return mean if there are any words with a score, else return nan
return np.mean(happiness_score) if bool(happiness_score) else np.nan , np.std(happiness_score) if bool(happiness_score) else np.nan
#Get happiness score for all chapters
remove_stopwords = True # Remove stopwords from token list
# List for overall score each chapter
chapter_happiness = []
# List for standard deviation of happiness score each chapter
chapter_happiness_std = []
# List for lower case tokens
all_tokens_lower = []
for index, row in df.iterrows():
# Lower case tokens and remove stopwords
chapter_tokens = [token.lower() for token in row['plot_tokens'] if (remove_stopwords and token not in stopwords.words('english'))]
all_tokens_lower.append(chapter_tokens)
# Compute chapter happiness scores and standard deviation
score, std = GetHappinessScore(chapter_tokens)
# Append happiness score
chapter_happiness.append(score)
# Append its standard deviation
chapter_happiness_std.append(std)
# Add to dataframe
df['happiness'] = chapter_happiness
df['happiness_std'] = chapter_happiness_std
df['tokens_cleaned'] = all_tokens_lower
df
We plot the chapters as time on the x-axis and the mean happiness score along the y-axis. This show the development as the series progresses. The trendline is a linear fit. Interesting peaks and valleys are highlightet with a small description of the chapter.
The plot is created with and without standard deviation for comparison.
# Setup for plots
def setup_mpl():
mpl.rcParams['font.size']=5
mpl.rcParams['figure.figsize']=(4.5,2.5)
mpl.rcParams['figure.dpi']=200
setup_mpl()
def plotSeriesHappiness(std, df):
# Ongoing chapter number
ch = [i for i in range(1,len(df)+1)]
df['chapter_in_series'] = ch
# Make trendline
z = np.polyfit(df['chapter_in_series'], df['happiness'], 1)
p = np.poly1d(z)
# Plot
fig, ax = plt.subplots()
for i in range(1,8):
dfbook = df[df['book']==i]
ax.plot(dfbook['chapter_in_series'], dfbook['happiness'], color=colors[i-1], label=f"book {i}")
if std:
ax.fill_between(dfbook['chapter_in_series'], dfbook['happiness']+dfbook['happiness_std'], dfbook['happiness']-dfbook['happiness_std'], facecolor=colors[i-1], alpha=0.2)
ax.plot(df['chapter_in_series'],p(df['chapter_in_series']),"k--", label="Trend line")
# Chapters that could be of interest to look at
ax.text(42,df['happiness'][41], "Boggart class", size='small')
ax.plot(42,df['happiness'][41],'r.')
ax.text(64,df['happiness'][63], "Quidditch world cup", size='small')
ax.plot(64,df['happiness'][63],'r.')
ax.text(89,df['happiness'][88], "Cedric killed", size='small')
ax.plot(89,df['happiness'][88],'r.')
ax.text(116,df['happiness'][115], "Arthur in hospital after snake attack", size='small')
ax.plot(116,df['happiness'][115],'r.')
ax.text(119,df['happiness'][118], "People start believing Harry", size='small')
ax.plot(119,df['happiness'][118],'r.')
ax.text(129,df['happiness'][128], "Sirius killed by Bellatrix", size='small')
ax.plot(129,df['happiness'][128],'r.')
ax.text(148,df['happiness'][147], "Christmas at the burrow", size='small') # fake positive - read plot
ax.plot(148,df['happiness'][147],'r.')
ax.text(169,df['happiness'][168], "Harry's 17th birthday, kisses Ginny", size='small')
ax.plot(169,df['happiness'][168],'r.')
ax.text(196,df['happiness'][195], "Harry goes to die", size='small')
ax.plot(196,df['happiness'][195],'r.')
ax.text(161,df['happiness'][160], "Hospital wing after battle with Death Eaters", size='small')
ax.plot(161,df['happiness'][160],'r.')
ax.legend(loc="lower left", ncol=3)
ax.set_xlabel("Chapter in series")
ax.set_ylabel("Mean happiness score")
if std:
ax.set_title("Series happiness with Standard deviation")
else:
ax.set_title("Series happiness")
plt.show()
plotSeriesHappiness(False, df)
It looks like most chapters score a little happier than neutral (score of 0) and that our hypothesis of the series getting less happy as time goes on is supported. The highlighted peaks and valleys will be explored further below.
What happens when we look at the standard deviation of the happiness score?
plotSeriesHappiness(True, df)
# How many tokens do we have in each chapter
print(f"Shortest chapter summary has {min([len(ch) for ch in df['tokens_cleaned']])} tokens. Longest chapter summary has {max([len(ch) for ch in df['tokens_cleaned']])} tokens")
The labMT dataset was created for and originally used on twitter-data. That is, on a much bigger dataset. Our cleaned chapter plot summaries are very short texts of 30-621 words. This is what causes the very large standard deviation on the mean happiness scores. Few individual words can have an outsized role on the score of a chapter. Looking at this plot, there isn't as much to say about the peaks, valleys or trends. Almost every chapter has a happiness score of 0.5 plus/minus 1.
We want to see if the within-book happiness trend is similar to the full-series trend. These scores are of course still subject to the standard deviations shown above.
# Each book individually
fig, ax = plt.subplots()
for i in range(1,8):
dfbook = df[df['book']==i]
ax.plot(dfbook['chapter_number'], dfbook['happiness'], color=colors[i-1], label=f"book {i}")
ax.legend(loc="upper right", ncol=2)
ax.set_xlabel("Chapter in book")
ax.set_ylabel("Mean happiness score")
ax.set_title("Chapter happiness")
plt.show()
This is rather messy, so we look at the trendlines.
fig, ax = plt.subplots()
for i in range(1,8):
dfbook = df[df['book']==i]
# normalise from number of chapters to [0,1]
x = dfbook['chapter_number']
# make trendline
z = np.polyfit(x, dfbook['happiness'], 1)
p = np.poly1d(z)
ax.plot(x, p(x), color=colors[i-1], label=f"book {i}")
ax.legend(loc="lower left", ncol=3)
ax.set_xlabel("Chapter")
ax.set_ylabel("Mean happiness score")
ax.set_title("Book Happiness - Trendlines")
plt.show()
The trendlines show us that all the books have a negative happiness trend. This supports our hypothesis, but only vaguely as the actual difference in happiness score from beginning to end is very small. The books however have different chapter lengths, so below we normalise for a better comparison.
fig, ax = plt.subplots()
for i in range(1,8):
dfbook = df[df['book']==i]
# normalise from number of chapters to [0,1]
x = np.arange(0,1,1/len(dfbook))
# make trendline
z = np.polyfit(x, dfbook['happiness'], 1)
p = np.poly1d(z)
ax.plot(x, p(x), color=colors[i-1], label=f"book {i}")
ax.legend(loc="lower left", ncol=3)
ax.set_xlabel("Depth in book")
ax.set_ylabel("Mean happiness score")
ax.set_title("Book Happiness - Trendlines")
plt.show()
Now the books are comparable and we see that book 5 appears to have the steepest decline and book 7 starts out the least happy. We know from above that the mean-scores can seem to tell us more than they actually do, so we add the standard deviations back in as above.
fig, ax = plt.subplots()
for i in range(1,8):
dfbook = df[df['book']==i]
x = np.arange(0,1,1/len(dfbook))
# make trendline
z = np.polyfit(x, dfbook['happiness'], 1)
p = np.poly1d(z)
ax.plot(x, p(x), color=colors[i-1], label=f"book {i}")
ax.fill_between(x, dfbook['happiness']+dfbook['happiness_std'], dfbook['happiness']-dfbook['happiness_std'], facecolor=colors[i-1], alpha=0.2)
ax.legend(loc="lower left", ncol=3)
ax.set_xlabel("Depth in book")
ax.set_ylabel("Mean happiness score")
ax.set_title("Book Happiness - Trendlines with Standard Deviation")
plt.show()
This final plot tells us once again that the small quantity of words in each chapter makes for a large standard deviation. And really, with this data there maybe isn't all that much we can conclude based on the happiness score sentiment analysis.
# a function that computes and plots the word shifts
def createWordShift(df, d):
# d is the chapters full-series number minus 1 (since the dataframe is zero-indexed)
# get the clean tokens for the chapter of interest
l = df['tokens_cleaned'][d]
# get clean tokes for the 3 chapters preceeding chapter d
l_ref = []
for i in range(3):
l_ref.extend(df['tokens_cleaned'][d-3:d].values[i])
# compute relative frequency for each token in list l
p = dict([(item[0], item[1]/len(l)) for item in Counter(l).items()])
# compute relative frequency for each token in list l_ref
p_ref = dict([(item[0], item[1]/len(l)) for item in Counter(l_ref).items()])
# get set of tokens
all_tokens = set(p.keys()).union(set(p_ref.keys()))
# compute difference between p_ref and p
delta_p = dict([(token, p.get(token,0) - p_ref.get(token,0)) for token in all_tokens])
# compute happiness score for each token
h = dict([(token, happiness_score_dict.get(token, np.nan)) for token in all_tokens])
# compute happiness times difference in relative frequency for each token
d_phi = [(token, h[token]*delta_p[token]) for token in all_tokens if not np.isnan(h[token])]
# plot wordshifts using shifterator
sentiment_sh = sh.WeightedAvgShift(type2freq_1=p_ref,
type2freq_2=p,
type2score_1=happiness_score_dict,
reference_value=0)
sentiment_sh.get_shift_graph(detailed=True,
system_names = ['Previous 3 chapters',f'Book {df.book[d]}, chapter {df.chapter_number[d]}'])
The word shifts for these select chapters show what words are causing them to stand out as particularly positive or negative compared to the previous 3 chapters. We have picked 4 chapters that stand out as particularly positive or negative in the series-happiness plot above to look at a little closer
chapters = [88, 128, 147, 168]
chapter_string = ["Cedric killed", "Sirius killed by Bellatrix", "Christmas at the burrow", "Harry's 17th birthday, kisses Ginny"]
for idx, chapter in enumerate(chapters):
print(chapter_string[idx])
createWordShift(df, chapter)
The two first chapters have rather negative shifts. This is expected as a good character is killed by evil forces. Words like "kill", "grave", horror", "dead", "defeated" and "battle" olay a large role in these negative shifts.
the two last chapters have positive shifts. These are about a Christmas and a birthday. Words like "Christmas", "holidays", "birthday", "wedding", and "kisses" pull these chapters in the positive direction.
The labMT dataset is not made for Harry Potter, so some words that take on a specific meaning in the wizarding world may have more colloquial meanings in general English. An example of this is "lord" which in Harry Potter always refers to Lord Voldemort, and certainly could be categorised as unhappy, is shown in the first highlighted chapter to be a very positive word, probably because people have religious associations with it. Another word where the reverse is true is "snitch" - rather negative in labMT happiness score, where it in Harry Potter refers to a ball in quidditch and thus not negative at all.
Another thing worth noting is that much like the happiness plots above, these word shifts are affected by the very small amount of words in each chapter. Individual words can have an outsized effect.
The method also has its limitations. It cannot understand context and sometimes seemingly positive words can be put together to form obviously negative sentences. The Christmas chapter we have picked out here is an example of this. Some of the positive words like "boost", "innocent" and "ministry" are actually not that positive when you get them in context - see the summary below.
df.summary[147]
We chose a topic we are all very interested in which made it super fun to work with. It made it easy to see patterns and made us able to make an extensive analysis due to our domain knowledge.
We found it difficult to work with the website and insert dynamic visualizations, interactive analysis, etc. as this was not introduced in class.
The drawback of community partion methods is that they have a tendency to let small clusters be absorbed by larger ones. This would make our community analysis less nuanced.
We worked with summaries and not the full texts. A rule of thumb is that more data lead to more information. We noticed that only 183 of of 403 characters are mentioned in the summaries which influence the networks. Yet, one have to assume that if they are not mentioned in the summaries, they are not important characters and thus irrelevant. Using book chapters instead of summaries would provide more words which could have resultet in a more informative sentiment analysis. A work of fiction might also use more expressive words than a summary, which again might have affected the sentiment analysis.
The sentiment analysis could have benefitted from a better tokenisation/stemming process which would have made more words available for happiness score calculations. This might have improved the uncertainty a bit and reduced the large standard deviations.
Data from